Add to_batches() and interpolate() methods to DataFrame#1241
Add to_batches() and interpolate() methods to DataFrame#1241H0TB0X420 wants to merge 1 commit intoapache:mainfrom
Conversation
- Add to_batches() as alias for collect() returning RecordBatch list - Add interpolate() method with forward_fill support - Add deprecation warning to collect() method - Add comprehensive tests for both methods Addresses items from RFC apache#875
| def to_batches(self) -> list[pa.RecordBatch]: | ||
| """Convert DataFrame to list of RecordBatches.""" | ||
| return self.collect() # delegate to existing method |
There was a problem hiding this comment.
My opinion (see #1227) is to limit the surface area where we explicitly depend on pyarrow. Especially in this case it's just an alias?
There was a problem hiding this comment.
100% agree, this should return arro3 recordbatches
There was a problem hiding this comment.
I'm not saying it should even return arro3 RecordBatches... because that's still an external dependency that datafusion is requiring on users. Datafusion could return a minimal batch object that just holds the RecordBatch pointer and then the user transfers it to their library of choice.
There was a problem hiding this comment.
I'm not saying it should even return arro3 RecordBatches... because that's still an external dependency that datafusion is requiring on users. Datafusion could return a minimal batch object that just holds the RecordBatch pointer and then the user transfers it to their library of choice
That's possible but I would argue it's more convenient if it's in some way already usable instead of just a pointer. Arro3 is that small that it's negligible in size overall
There was a problem hiding this comment.
Perhaps we should get @timsaucer 's thoughts in #1227: is having a required dependency on pyarrow a problem? What should we do about it? Do we want to depend on arro3-core instead? Or have functions that rely on pyarrow but error if pyarrow isn't installed?
@ion-elgreco feel free to write down your thoughts there too
There was a problem hiding this comment.
I understand that pyarrow is a large dependency, but I also have a feeling from the community that nearly everyone who is using datafusion-python to do things like to_batches is using pyarrow in the next portions of their code. I'm sure we can always find exceptions.
Now if there is a way we can remove this dependency and it doesn't break existing workflows, that would be even better. I haven't made the time to sit down and play with it, though.
There was a problem hiding this comment.
Now if there is a way we can remove this dependency and it doesn't break existing workflows
I think the straightforward way to do that is to remove pyarrow as a required dependency and error if it's not installed. But a separate question is whether we should be adding new methods that explicitly depend on pyarrow.
There was a problem hiding this comment.
I could do a try-except to throw an ImportError if it is not installed. But it might make more sense to drop to_batches() from this PR given the dependency discussion. I don't want to add to the PyArrow dependency concerns.
I'll remove it and focus on fixing the interpolate() implementation. I appreciate you taking the time to give feedback!
timsaucer
left a comment
There was a problem hiding this comment.
Thank you for the PR! I understand getting your first PR in a new project can be daunting. I know I have some critical feedback here, but I do appreciate the effort.
| def interpolate(self, method: str = "forward_fill", **kwargs) -> DataFrame: | ||
| """Interpolate missing values per column. | ||
|
|
||
| Args: | ||
| method: Interpolation method ('linear', 'forward_fill', 'backward_fill') |
There was a problem hiding this comment.
At the outset this doesn't look quite right to me.
- The method is called
interpolatebut the one interpolation method is the one not supported. The others are filling operations, not interpolations. - This looks like it's going to create a very wrong filling - every field in the schema gets sorted and then filled? I would expect we would need an ordering column to determine the filling method.
- If we were to do this I think the most pleasant experience would have an enum for the possible values with the simple string conversions necessary.
There was a problem hiding this comment.
Two quick questions:
- Should we have separate methods for filling vs interpolation?
- Do we wait for DataFusion core features before adding Python APIs, or is it okay to create stand-in Python implementations using existing primitives?
Thank you for your time!
| result = df.interpolate("forward_fill") | ||
|
|
||
| assert isinstance(result, DataFrame) |
There was a problem hiding this comment.
We would probably want to collect the results and verify they fill as expected.
|
Thanks for the feedback! |
Addresses items from RFC RFC: Re-work some DataFrame APIs #875
This is my first PR! Please let me know what I can improve.